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Abstract 

Independent Component Analysis (ICA) models are very popular semiparametric 
models in which we observe independent copies of a random vector X = AS^ where A 
is a non-singular matrix and S has independent components. We propose a new way 
of estimating the unmixing matrix W = A~^ and the marginal distributions of the 
components of S using nonparametric maximum likelihood. Specifically, we study the 
projection of the empirical distribution onto the subset of ICA distributions having log- 
concave marginals. We show that, from the point of view of estimating the unmixing 
matrix, it makes no difference whether or not the log-concavity is correctly specified. 
The approach is further justified by both theoretical results and a simulation study. 
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1 Introduction 



In recent years, Independent Component Analysis (ICA) has seen an explosion in its pop- 
ularity in diverse fields such as signal processing, machine learning, and medical imaging, 
to name a fe w. For a wide-ranging list of a lgorit hms and applications of ICA, see the 
monograph by iHyvarinen. Karhunen and Ojal (120011 ) . In the ICA paradigm, one observes a 
random vector X G R'^ that can be expressed as a non-singular linear transformation of d 
mutually independent latent factors Si, . . . , Sd] thus X = AS where S = {Si, . . . , SdY and 
A is a dx d full rank matrix often referred to as the mixing matrix. As such, ICA postulates 
the following model for the probability distribution P of X: for any Borel set B in M'^, 

d 

P{B) = l[P,iwjB), 

i=i 

where W = (wi, . . . , WdY = is the so-called unmixing matrix, and Pi, . . . ,Pd are the 
univariate probability distributions of the latent factors Si, . . . ,Sd respectively. 

The goal of ICA, as in other blind source separation problems, is to infer from a sample 
xi, . . . , x„ of independent observations of X, the independent factors Si = IVxi, . . . , s„ = 
W^ni or equivalently the unmixing matrix W . This task is typically accomplished by first 
postulating a certain parametric family for the marginal probability distributions Pi, ... , Pd, 
and then optimising a contrast function involving {W,Pi, . . . ,Pd)- The contrast functions 
are often chosen to represent the mutual information as measured by KuUback-Leibler di- 
vergence or maximum entropy; or non-Gaussianity as measured by kurtosis or negentropy. 



Alternatively, in recent years, methods fo r ICA have also 



J-L. 



Pd have smooth (log) de i isities , e.g. 



Bach and Jordan 



(120031 ) , ISamarov and Tsybakovi ( 120041 ) and IChen and Bickell ( 12006! ) . Although more flexible 



j een d e veloped which assume 

mm 



Hastie and Tibshirani 



than their aforementioned parametric peers, there remain unsettling questions about what 
happens if the smoothness assumptions on the marginal densities are violated, which may 
occur, in particular, when some of the marginal probability distributions Pi,...,Pd have 
atoms. Another issue is that, in common with most other smoothing methods, a choice of 
tuning parameters is required to balance the fidelity to the observed data and the smooth- 
ness of the estimated marginal densities, and it is notoriously difficult to select these tuning 
parameters appropriately in practice. 

In this paper, we argue that these assumptions and tuning parameters are unnecessary. 
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and propose a new paradigm for ICA, based on the notion of nonparametric maximum 
likelihood, that is free of these burdens. In fact, we show that the usual nonparametric 
(empirical) likelihood approach does not work in this context, and instead we proceed under 
the working assumption that the marginal distributions of 5*1, . . . , 5*^ are log-concave. More 
specifically, we propose to estimate W by maximising 

log I det I + - ^ ^ log fj {wj^i) 

^ i=i j=i 

over all d X d non-singular matrices W = {wi, . . . , Wd)'^ , and univariate log-concave densities 
fi, . . . , fd- Remarkably, from the point of view of estimating the unmixing matrix W, it turns 
out that it makes no difference whether or not this hypothesis of log-concavity is correctly 
specified. 

The key to understanding how our approach works is to study what we call the log- 
concave ICA projection of a distribution on M*^ onto the set of densities that satisfy the ICA 
model with log-concave marginals. In Section 1271] below, we define this projection carefully, 
and give necessary and sufficient conditions for it to make sense. In Section \2.2\ we prove 
that the log-concave projection of a distribution from the ICA model preserves both the 
ICA structure and the unmixing matrix. Finally, in Section 12. 3[ we derive a continuity 
property of log-concave ICA projections, which turns out to be important for understanding 
the theoretical properties of our ICA procedure. 

Our ICA estimating procedure uses the log-concave ICA projection of the empirical 
distribution of the data, and is studied in Section [3l After explaining why the usual empirical 
likelihood approach cannot be used, we prove the consistency of our method. We also 
present an iterative algorithm for the computation of our estimator. Our simulation studies 
in Section H] confirm our theoretical results and show that the proposed method compares 
favourably with existing methods. 

2 Log-concave ICA projections 

Our proposed nonparametric maximum likelihood estimator can be viewed as the projection 
of the empirical distribution of xi,...,x„ onto the space of ICA distributions with log- 
concave densities. To understand its behavior, it is useful to study the properties of such 
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projections in general. 



2.1 Notation and overview 

Let Vk be the set of probability distributions P on M'^ satisfying ||x|| dP{x) < oo and 
P{H) < 1 for all hyperplanes if, i.e. the probability measures in that have finite mean 
and are not supported in a translate of a lower dimensional linear subspace of M'^. Here and 
throughout, || ■ || denotes the Euclidean norm on R^, and we will be interested in the cases 
k = 1 and k = d. Further, let W denote the set of non-singular d x d real matrices. We use 
upper case letters to denote matrices in W, and the corresponding lower case letters with 
subscripts to denote rows: thus wj is the jth row of W & W. Let Bk denote the class of 
Borel sets on M*"'. Then the ICA model is defined to be the set of P GVd of the form 

d 

P{B) = l[P,{wjB), \fBeBd, (1) 



for som e TV G W and Pi, . . . , Pd E Vi. As shown by iDiimbgen. Samworth and Schuhmacher 



(12011 



Theorem 2.2), the condition P G Vd is necessary and sufficient for the existence 
of a unique upper semi- continuous and log-concave density that is the closest to P in the 
Kullback-Leibler sense. More precisely, let J-^ denote the class of all upper semi-continuous, 
log-concave densities with respect to Lebesgue measure on M.^. Then the projection tp* : 
Vd J^d given by 

ijj* (P) = aigmax / log f dP 

is well-defined and surjective. In what follows, we refer to ip* as the log-concave projection 
operator and /* := ip*{P) as the log-concave projection of P. By a slight abuse of notation, 
we also use ip* to denote the log-concave projection from Vi to J^i. 

Although the log-concave projection operator does play a role in this paper, our main 
interest is in a different projection, onto the subset of J-^ consisting of those densities that 
satisfy the ICA model. This class is given by 

J-iCA _ I J ^ jr^ . j^^) _ I -Q f.^^-jx) dx for some W e W and fu fd e J^i\ . 

^ i=i ^ 

(2) 

Note that, in this representation, if X has density / G J^f"^, then wJX has density fj. The 
corresponding log-concave ICA projection operator is defined for any distribution P 



on R'^ by 

(p) = argmax / log fdP. 
We also write L** (P) = supj^jricA J^^ log / dP. 

Proposition 1. 1. If J^^ \\x\\ dP{x) = oo, then L**{P) = -oo and i)**{P) = Ff^. 

2. If Jjgd ||a;|| dP{x) < oo, but P{II) = 1 for some hyperplane H , then L**{P) = oo and 
iIj**{P) = 0. 

3. If P E Vd, then L**{P) G R and ip**{P) defines a non-empty, proper subset of J-'^^ . 

In view of Proposition [H and to avoid lengthy discussion of trivial exceptional cases, 
we henceforth consider as being defined on Vd- In contrast to ip*{P), which defines 

a unique element of J-^, the log-concave ICA projection operator ip**{P) may not define a 
unique element of J^d"^, even for P G Vd- For instance, consider the situation where P is 
the uniform distribution on the closed unit disk in equipped with the Euclidean norm. 
Here, the spherical symmetry means that the choice of G W is arbitrary. In fact, after a 
straightforward calculation, it can be shown that ip**{P) consists of those / G J^d"^ where, 
in the representation ([2]), ly G W is arbitrary and /i, /2 G J^i are given by fi{x) = /2(x) = 
|(1 — 1]}. It is certainly possible to make different choices of W that yield 

different elements of J^d"^- This example shows that, in general, we must think of %l)**[P) 
as defining a subset of J^d^^- 

The relationship between the spaces introduced above and the projection operators is 
illustrated in the diagram below: 

Vd A :Fd 
\ 

V'**LlCA 

ICA 4 ttICA 



Our next subsection studies the restriction of ip** to P^^^, denoted ?/'**|picA; Section [52] 
examines if)** more generally as a map on Vd- 

2.2 Log-concave projections of the ICA model 

Our first result in this subsection characterises ?/'**|picA. 
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Theorem 2. If P G Vf^^, themp**{P) defines a unique element ofj'f^^. The map ip**\'picA 
is surjective, and coincides with ip*\'picA. Moreover, suppose that P E V^^, so that 

d 

P{B) = l[Pj{wjB), \/BeBd, 
i=i 

for some W eW and Pi, . . . , Pd E Vi. Then /** = il)**{P) can he written as 



f**{x) = \detW\\{f*{w] 



X] 



where f* = ip*{Pj)- 



It is interesting to observe that log-concave projection operator ip* preserves the ICA 
structure. But perhaps the most important aspect of this result is the fact that the same 
unmixing matrix W can be used to represent both the original ICA model and its log-concave 
projection. This observation lies at the heart of the rationale for our approach to ICA. 

A remaining concern is that the unmixing matrix may not be identifiable. For instance, 
applying the same permutation to the rows of W and the vector of marginal distributions 
(Pi, . . . , Pfi) leaves the distribution unchanged; similarly, the same effect occurs if we multiply 
any of the rows of by a scaling factor and applying the corresponding scaling factor to 



the relevant r nargina. 



distrib ution. The question of identifiability for ICA models was first 



addressed by IComonI (Il994r). who assumed th at W is orthogonal, and was settled in the 



general case by Eriksson and KoivunenI (120041 ). One way to state their result is as follows: 
suppose that a probability measure P on has two representations as 

d d 



(3) 



where W , G W and Pi, ... , P^, Pi, . . . , P^ are probability measures on M. Then the pair 
of conditions that Pi, . . . , P^ are not Dirac point masses and not more than one of Pi, . . . , P^ 
is Gaussian is necessary and sufficient for the existence of a permutation vr of {1, . . . , c?} and 
scaling vector e = (ei, . . . , e^) G (M \ {O})"^ such that Pj{Bj) = PT,(^j){ejBj) for all Bj G Bi, 
and Wj = eJ'^WT^(jy When such a permutation and scaling factor exist for any two ICA 
representations of P, we say that the ICA representation of P is identifiable, or simply that 
P is identifiable. By analogy, we define / G ^^""^ to be identifiable if not more than one of 
/i, . . . , /rf in the representation ([2]) is Gaussian. 
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Our next result shows that ip** preserves the identifiabihty of the ICA model. Together 
with Theorem |2l we see that if P G Vj["^ is identifiable, then the unmixing matrices of 
P e rf^ and i)**{P) are identical up to the permutation and scaling transformations 
described above. 

Theorem 3. Let P G "Pj*^^- Then ip**{P) is identifiable if and only if P is identifiable. 
2.3 General log-concave ICA projections 

We now consider the general log-concave ICA projection ip** defined on Vd- Define the 
Mallows distance d (also known as the Wasserstein distance) between probability measures 
P and Q on M'^ with finite mean by 

d{P,P)= inf E||X-r||, 

(X,X)~(P,P) 

where the infimum is taken over all pairs {X, Y) of random vectors X ~ P and X ~ P on 
a common probability space. Recall that rf(P", P) ^ if and only if both P" A P and 
/jgd dP"-{x) — > Jjgd dP{x). We are interested in the continuity of ip**- 

Proposition 4. Let P, P^, P^, . . . be probability measures in Vd with d{P"', P) — )■ as n ^ 
oo. Then L**{P'') L**{P). Moreover, 

sup inf / I/" - /I -> 

as n ^ oo. 

The second part of this proposition says that any element of ?/'**(P") is arbitrarily close in 
total variation distance to some element of ip** (P) once n is sufficiently large. In the special 
case where ip**{P) consists of only a single element, we can say more. It is convenient to let 

ICA 

Ud denote the set of permutations of {!,..., d}, and write (W, fi, . . . , fd) ~ f if W E W 
and /i, . . . , /d G J-*! can be used to give an ICA representation of / G J^d^^ ©• Similarly, 
we write {W, Pi, ... , Pd) PifW eW and Pi, . . . , P^ G Pi represent P G Pf^ in (JH). 

Theorem 5. Suppose that P G Vf^^, and write f** = ip**{P). If P^,P^, ... eVd are such 
that d^P"", P) 0, then 

sup / |/--/-|^0. 
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Suppose further that P is identifiable and that (W, Pi, ... , Pd) P- Then 
sup sup inf inf {Wi^'jY^w'^^ij-) - 



for each j = 1, . . . ,d, where f* = il)*{Pj). As a consequence, for sufficiently large n, every 
fn g ^**(pn) ^5 identifiable. 

The first part of Tlieorem [5] sliow that if P G Pj^^ and P E Pd are close in Mallows 
distance, then every / G ip**{P) is close to the corresponding (unique) log-concave ICA 
projection / = il)**{P) in total variation distance. The second part shows further that if P 
is identifiable, then up to permutation and scaling, every / G il)**{P) and every choice of 
unmixing matrix W and marginal densities /i, . . . , in the ICA representation of / is close 
to the unmixing matrix W and marginal densities /i, . . . , in the ICA representation of /. 

To conclude this subsection, we remark that, by analogy with the situation when P G 
V^^^ described in Theorem |2l if P G and X ~ P, any /** G ip**{P) can be written as 

d 

f**{x) = \detW\l[f;{wJx), 

i=i 

for some W G W, where f* = ip*{Pj), and Pj is the marginal distribution of wJX. This 
observation reduces the maximisation problem involved in computing ip**{P) to a finite- 
dimensional one (over W G W), and follows because 

sup / log fdP= sup sup < log I det H^l + > / log fj{w'^ x) dP{x)> 
= sup <^ log I det iy| + / logfUwJx) dP{x) \. 

3 Nonparametric maximum likelihood estimation for 
ICA models 

We are now in position to study the proposed nonparametric maximum likelihood estimator. 
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3.1 Estimating procedure and theoretical properties 

Now assume xi , X2 , . . . are independent copies of a random vector X G M'' satisfying the 
ICA model. Thus X = AS, where A = G W and S = {Si, . . . , S'^)"'' has independent 
components. In this section, we study a nonparametric maximum hkehhood estimator of W 
and the marginal distributions Pi, . . . , P^^ of Si, . . . , Sd based on xi, . . . , x„, where n > d+1. 
We start by noting that the usual nonparamet ric maximurn likelihood estimate does not 



work. Indeed, in the spirit of empirical likelihood (jOwenl . Il990l ). it would suffice to consider, 
for a given W = {wi, . . . , Wd)""" G W, estimates Pj of the marginal distribution Pj, supported 
on wjxi, . . . , wj:s.n- This leads to the nonparametric likelihood 

n d 

L{W,Pi,...,P,) = \[\[p,„ (4) 

i=i i=i 

where pij = Pj(wjxj). Let J denote a subset of {d + 1) distinct indices in {1, ... , n}, and let 
Xj denote the d x [d + 1) matrix obtained by extracting the columns of X = (xi, . . . , x„) 
with indices in J. Now let X(_j) denote the d x d matrix obtained by removing the jth 
column of Xj. Let Wj G W have jth row Wj = (X^^^)"'"!^, for j = 1, . . . , d, where Id is a 
(i-vector of ones. Our next result shows that every Wj corresponds to a maximiser of the 
nonparametric likelihood (jl]). 

Proposition 6. Suppose that xi, . . . ,x„ are in general position. Then for any choice J of 
{d + 1) distinct indices in {1, ... , n}, there exist Pi, . . . , P^, E Vi such that {Wj, Pi, ... , P^) 
maximises L{-). 

If X has a density with respect to Lebesgue measure on R"', then with probability 1, 
every subset of xi, . . . , x^ of size {d + 1) is in general position. On the other hand, there is 
no reason for different choices of J to yield similar estimates Wj, so we cannot hope for such 
an empirical likelihood-based procedure to be consistent. 

As a remedy, we propose to estimate P° G V^'"^ by ■?/'** (P"), where P" denotes the 
empirical distribution of xi, . . . , x„ ~ P*^. More explicitly, we estimate the unmixing matrix 
and the marginals by maximising the log-likelihood 

^ n d 

fi,..., fd) = /i, ...,/,; xi, xO = log \detW\ + -Y,Yl /^K^*) (5) 

^ i=i j=i 
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over W E W and /i, . . . , /d G J^i. Note from Proposition [T] that ip**{P"') exists as a proper 
subset of J-'^^^ once the convex hull of xi,...,x„ is (i-dimensional, which happens with 
probability 1 for sufficiently large n. As a direct consequence of Theorem [5] and the fact that 
d{P"', P") 0, we have the following consistency result. 

Corollary 7. Suppose that E V^'"^ is identifiable and is represented by W'^ G W and 
Pi°, ...,P^eVi. Then for any maximiser (W^", , . . . , /^) o/r(l^, /i, . . . , fa) over W eW 
and /i, • • • , /d G -Fi, there exist a permutation vr" of {1, ... ,d} and scaling factors e", . . . , G 
M \ {0} such that 

/oo 
||e7|/r^„(,.)(e7x)-/;(x)|rfxn-0, 
-oo 

for j = 1, . . . ,d, where f* = ip*{Pj) ■ 



3.2 Pre-whitening 

Pre-w hitening is a stan dard pre-processi n g tech nique in the ICA literature; see 



2001 



Hyvarinen. Karhunen and O 



pp. 140-141) or IChen and Bickell (l2005l ). In this subsection, we explain the rationale 



for pre-whitening and the simplifications it provides. 

Assume for now that P E Vf^ and dP{x) < oo, and let E denote the (positive- 

definite) covariance matrix corresponding to P. Consider the ICA model X = AS, where 
X ~ P, the mixing matrix A is non-singular and S = {Si, . . . , Sd) has independent com- 
ponents with Sj ~ Pj. Assuming without loss of generality that each component of S 
has unit variance, we can write T,~^^'^X = T.^^^'^AS = AS, say, where A belongs to the 
set 0{d) of orthogonal d x d matrices. Thus the unmixing matrix W belongs to the set 
0(d)S-i/2 = {OS-i/2 : O E 0{d)}. 

It follows that, if S were known, we could maximise with the restriction that W E 
0((i)S~^/^. In practice, E is typically unknown, but we can estimate it using the sample 
covariance matrix S. For n large enough that the convex hull of Xi, . . . , x„ is (i-dimensional, 
we can therefore consider maximising 

r(l^,/i,...,/,;xi,...,xO 

over W E 0((i)S~^/^ and fi,...,fd G J^i. Denote such a maximiser by (W"', /", . . . , /^). 
The corollary below shows that, under a second moment condition, W"' and . . . , f2 have 
the same asymptotic properties as the original estimators VT" and /",..., 
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Corollary 8. Suppose that P° G is identifiable, is represented by W'^ G W and 

Pi, . . . , G Pi and that J^^ W^W^ dP^{x) < oo. Then with probability 1 for sufficiently large 
n, a maximiser (W^", /f, ■ ■ ■ , k) of e{W, /i, . . . , fa) over W G 0(ci)S-V2 /i, g 
J^i exists. Moreover, for any such maximiser, there exist a permutation tt" of {1, ... ,d} and 
scaling factors , . . . , G M \ {0} such that 

/oo 
-oo 

where f* = ^*(P°). 

An alternative, equivalent way of computing {W'^, /f , . . . , /^) is to pre-whiten the data 
by replacing Xi, . . . , x„ with Zi = E'^^/^Xi, . . . , z„ = E'"-'^/^x„, and then maximise 

r{0,gi,...,gd;zi,...,Zn) 

over O G 0((i) and gi, ■ ■ ■ , gd ^ J^i- If (O", . . . , ^2^) is such a maximiser, we can then set 
W"' — 0"E~^/^ and = Note that pre- whitening breaks down the estimation of the d^ 
parameters in W into two stages: first, we use E to estimate the d{d-\-l)/2 free parameters of 
the symmetric, positive definite matrix E, leaving only the maximisation over the d{d— l)/2 
free parameters of O G 0{d) at the second stage. The advantage of this approach is that 
it facilitates more stable maximisation algorithms, such as the one described in the next 
subsection. 

3.3 Computational algorithm 

In this subsection, we address the challenge of maximising 

r(W^,yi,...,c/d;zi,...,z„) 

over W G 0{d) and gi, . . . ,gd G J-'i. As a starting point, we choose W to be randomly 
distributed according to Haar measure on the set 0{d) oi d x d orthogonal matrices. A 
simple way of generating W with this distribution is to generate a d x d matrix Z whose 
entries are independent A^(0, 1) random variables, compute the Qi^-factorisation Z — QR, 
and let W ^Q. 

Our proposed algorithm then alternates between maximising the log-likelihood over 
/i, . . . , /d for fixed W, and then over W for fixed /i, . . . , J^. The first of these steps is 
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straightforward given Theorem [2] and the recent work on log-concave density estimation: we 
set /,■ to be the log-concave maximum likelihood estimato r of the data wJyi^ , . . . , wjx„ . 



This can be computed using the R package logcondens ( jRufibach and Diimbgen . 



2006 



Diimbgen and Rufibachl . l201ll ). 



This leaves the challenge of updating W & 0{d). In order to describe our proposal, we 
recall some basic facts from differential geometry. The set 0{d) is a d{d — l)/2-dimensional 
submanifold of R'^\ The tangent space at W e 0{d) is TwO{d) := {WY : Y = -Y'^}. In 
fact, if we define the natural inner product (•, ■) on TwO{d) x TwO{d) by {U, V) = tr(f/V^"'"), 
then 0{d) becomes a Riemannian manifold. (Note that if we think of U and V as vectors in 

, then this inner product is simply the Euclidean inner product.) 

There is no loss of generality in assuming W belongs to the Riemannian manifold SO{d), 
the set of special orthogonal matrices having determinant 1. We can now define geodesies 
on SO{d), recalling that the matrix exponential is given by 

exp(r) = / + 5^-. 

r=l 

The unique geodesic passing through W G SO{d) with tangent vector WY (where Y = —Y'^) 
is the map a : [0, 1] SO{d) given by a(t) = Wexp{tY). 

We update W by moving along a geodesic in SO{d), but need to choose an appropriate 
skew-symmetric matrix Y, which ideally should (at least locally) give a large increase in the 
log-likelihood. The key to finding such a direction is Proposition |9] below. To set the scene 
for this result, observe that for x G [min(wjxi, . . . , wjx„), max(wjxi, . . . ,toJx„)], we can 
write 

log fj{x)= min {bjkX-/3jk), (6) 

fc— 1,111, m j 



for some hjk.Pjk ^ 1^ (e.g. ICule. Samworth and Stewartl . l2010l ). Since we may assume that 



bji, . . . ,bjmj are strictly decreasing, the minimum in (Q is attained in either one or two 
indices. It is convenient to let )Cij = argmmj^^i^ .{bjkwjyii — f3jk)- 



Proposition 9. Consider the map g : SO{d) — )■ M given by 

^71 d 

9{W) = - min (fojfcwjx, - f3jk)- 

n •^"^ •^"^ fe=l,...,m, 

i=l j=l 

Let Y be a skew-symmetric matrix and let cj denote the jth row of WY . If |/Cjj| = 1, let 
kij denote the unique element of )Cij. If |/Cjj| = 2, write /C^ = {%i,fcjj2}- cjxj > 0, let 
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kij = kiji, where I = argmin^^^g ^fc,,; / ^/ cjxj < 0, let = hji, where I = argmax;^;L,2 ^fc^^r 
Then the one-sided directional derivative of g at W in the direction WY is 

^ n d 

VwY9{W) := - J] J]6,fc,^cjx,. 

i=i j=i 

For 1 < s < r < d, let Yr^s denote the dx d matrix with Yr^s{r,s) = l/\/2, Fr-,s(s,r) = 
— and all other entries equal to zero. Then 3^"*" = {l^r,s : 1 < s < r < rf} forms an 

orthonormal basis for the set of skew-symmetric matrices. Let = {—Y : Y G y^}- We 
choose y™^'^ G y^ U y^ to maximise VwYQiY). 

We therefore update W with exp(ey™^^), and it remains to select e. This we propose 
to choose by means of a backtracking line search. Specifically, we fix « G (0, 1) and e = 1, 
and if 

^(iyexp(er"'^")) > g{W) + 

we accept a move from W to W exp{eY^^^) . Otherwise, we successively reduce e by a factor 
of 7 G (0, 1) until ([7]) is satisfied, and then move to W exp{eY'^^^) . In our implementation, 
we used a = 0.3 and 7 = 1/2. 

Our algorithm produces a sequence {W'^-^\ f[^\ . . . , f^), {W^'^\ ff \ . . . , ff^), We 

terminate the algorithm once 

r(w^w, ft\ . . . , /f ) - nw^'-'\ ft'\ • • • , ft'^) ^ 

where, in our implementation, we chose rj = 10~^. As with other ICA algorithms, global 
convergence is not guaranteed, so we used 10 random starting points and took the solution 
with the highest log-likelihood. 



4 Numerical Experiments 

To illustrate the practical merits of our proposed nonparametric maximum likelihood esti- 
mation method for ICA models, we conducted several sets of numerical experiments. To fix 
ideas, we focus on two-dimensional signals, that is d = 2. The components of the signal were 
generated independently, and then rotated by tt/3, so the mixing matrix is 

A=( '^''^ 
1^^3/2 1/2 

13 



Our goal is to reconstruct the signal and estimate A, or equivalently W = A~^, based on 
n = 200 observations of the rotated input. 

We first consider a typical example in the ICA literature where the density of each com- 
ponent of the true signal is uniform on the interval [—0.5, 0.5]. The top left panel of Figure [H 
plots the 200 simulated signal pairs, while the top right panel gives the rotated observa- 
tions. The bottom left panel plots the recovered signal using the proposed nonparametric 
maximum likelihood method. Also included in the bottom right panel of the figure are the 
estimated marginal densities of the two sources of signal. 



=f - 




Figure 1: Uniform signal: Top left panel, top right panel and bottom left panel give the true 
signal, rotated observations and the reconstructed signal respectively. The bottom right 
panel gives the estimated marginal densities along with the true marginal (grey line). 



Figure [2] gives corresponding plots when the marginals have an Exp(l) — 1 distribution. 
We note that both uniform and exponential distributions have log-concave densities and 
therefore our method not only recovers the mixing matrix but also accurately estimates the 
marginal densities, as can be seen in Figures [U and El 

To investigate the robustness of the proposed method when the marginal components do 
not have log-concave densities, we repeated the simulation in two other cases, with the true 
signal simulated firstly from a t- distribution with two degrees of freedom scaled by a factor 
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Figure 2: Exponential signal: Top left panel, top right panel and bottom left panel give the 
true signal, rotated observations and the reconstructed signal respectively. The bottom right 
panel gives the estimated marginal densities along with the true marginal (grey line). 



of l/\/2 and secondly from a mixture of normals distribution 0.7A^(— 0.9, 1) + 0.3A^(2.1, 1). 
Figures [3] and H] show that, in both cases, the misspecification of the marginals does not 
affect the recovery of the signal. Also, the estimated marginals represent estimates of the 
log-concave projection of the true marginals (a standard Laplace density in this case), as 
correctly predicted by our theoretical results. 

As discussed before, one of the unique advantages of the proposed method over existing 
ones is its general applicability. For example, the method can be used even when the marginal 
distributions of the true signal do not have densities. To demonstrate this property, we 
now consider simulating signals from a Bin(3,l/2) — 1.5 distribution. To the best of our 
knowledge, none of the existing ICA methods are applicable for these types of problems. 
The simulation results presented in Figure |5] suggest that the method works very well in this 
case. 

To further conduct a comparative study, we repeated each of the previous simulations 200 
times and computed our estimate along with those produced by the FastICA and ProDenICA 
methods. FastICA is a popular parametric ICA method; ProDenICA is a nonparametric ICA 
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Figure 3: t2 signal: Top left panel, top right panel and bottom left panel give the true signal, 
rotated observations and the reconstructed signal respectively. The bottom right panel gives 
the estimated marginal densities along with the true marginal (grey line). 



method proposed by lHastie and Tibshiranil (120031 ) . and has been shown to enjoy the best per- 



form ance among a large collection of existing ICA methods (IHastie. Tibshirani and Friedman . 



20091 ) . Both the FastICA and ProDenIC A methods were implemented using the R package 



ProDenICA (IHastie and Tibshirani 120101). To compare the performance of these methods, 
we follow convention (IHyvarinen. Karhunen and Ojal . 1200 ll ) and compute the Amari metric 
between the true unmixing matrix W and its estimates. The Amari metric between two 
d X d matrices is defined as 



M{A,B) 



1 



2d \^maxi<j<d \Cij 



1 + 



2d 





C 






maxi<i<rf 


Cij\ 



1 



where C = {Cij)i<i,j<d = AB^^. Boxplots of the Amari metric for all three methods are 
given in Figure El 

It is clear that both nonparametric methods outperform the parametric method. Several 
further observations can also be made on the comparison between the two nonparametric 
methods. For both uniform and exponential marginals, the proposed method improves upon 
ProDenICA. This might be expected since both distributions have log-concave densities. It 
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Figure 4: Mixture of normals signal: Top left panel, top right panel and bottom left panel 
give the true signal, rotated observations and the reconstructed signal respectively. The 
bottom right panel gives the estimated marginal densities along with the true marginal 
(grey line). 



17 



Truth 



Rotated 




-6 -4 -2 O 2 4 6 -1.5 -1.0 -0.5 O.O 0.5 1.0 1.5 

Si s 



Figure 5: Binomial signal: Top left panel, top right panel and bottom left panel give the 
true signal, rotated observations and the reconstructed signal respectively. The bottom right 
panel gives the estimated marginal densities. 





Figure 6: Comparison between LogConICA, FastICA and ProDenlCA. 
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is, however, interesting to note the robustness of the proposed method on the marginals as 
it still outperforms ProDenICA for ^2 marginals, and remains competitive for the mixture 
of normal marginals. The most significant advantage of the proposed method, however, 
is displayed when the marginals are binomial. Recall that ProDenICA, and perhaps all 
existing nonparametric methods, assume that the log density (or density itself) is smooth. 
This assumption is not satisfied with the binomial distribution and as a result, ProDenICA 
performs rather poorly. In contrast, our proposed method works fairly well in this setting 
even though the true marginal does not have a log-concave density with respect to Lebesgue 
measure. All these observations confirm our earlier theoretical development. 

5 Proofs 

Proof of Proposition [U 

1. Suppose that f^^ \\x\\ dP{x) = oo. Fix an arbitrary / G J^l^^y find a > and /3 G M 
such that f{x) < e-"ll^'ll+^. Then 



Thus L**{P) = -oo and iij**{P) = Ff^. 

2. Now suppose that Jj^^ ||x|| dP{x) < oo, but P{H) = 1 for some hyperplane H = 
{x G M'^ : ajx = a}, where ai is a unit vector in and a G M. Find 02, . . . , such that 
ai, . . . , Orf is an orthonormal basis for W^. Define the family of density functions 





2 



Then U e J^f^, and 




log(cr) — d\og2 





as cr — )■ 0. 
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3. Now suppose that P E Vd- Notice that the density f{x) = 2 '^11^=16 belongs to 



J-'}p^ and satisfies 



[ log f dP = - [ \xj\dP{x) -d\og2 



> — oo. 



Moreover, 



sup / log / dP < sup / log / dP < oo 



where the second inequality follows from the proof of Theorem 2.2 of lDiimbgen. Samworth and Schuhmachei 



(120 111 ). We may therefore take a sequence p, . . . E J^\^^ such that 



log rdP/" sup / log fdP. 



Let csupp(P) denote the convex support of P; that is, the intersection of all closed, con- 
vex sets having P-measure 1. Followi n g the arguments in the proof of Theorem 2.2 of 
Diimbgen. Samworth and Schuhmacherl (120 111 ), there exist a > and /3 G M such that 
sup„gj^/"(x) < e ""!!^!!"*"^ for all x E W^. M oreover, these arguments (see also the proof 



of Theorem 4 of ICule and SamworthI ( l2010l )) yield the existence of a closed, convex set 
C ^ int(csupp(P)), a log-concave density /** E Td with {x G M'' : /**(x) > 0} = C and a 
subsequence (/"'') such that 

r*{x) = Urn r^{x) for all x E int(C) U (R"^ \ C). 

k—>-oo 

Since the boundary of C has zero Lebesgue measure, we deduce from Fatou's lemma applied 
to the non-negative functions x t— )■ e~°"^'"'^^ — /"'=(x) that 



/ 



log /** dP > lim sup / log dP = sup 



log fdP 



It remains to show that /** E J^]p^- We can write 



n{x) = \deiW^\\{f^{{w]yx), 

where E W and E T\ for each A; G N and j = 1, . . . , c?. Let be a random vector 
with density G J^l^^, and let X be a random vector with density /** G J-^. We know 
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that X'' -4 X as A; — )■ oo, and that {w'^yx'', . . . , {w'^yx^ are independent for each k. Let 
Wj = Wj/\\Wj\\ and fj{x) = \\wj\\fj{\\wj\\x). Then we have 

d 

n{x) = \detW'\l[f^{{wfyx), (8) 
i=i 

where the matrix has jth row w^. Moreover, G W and f^, . . . , G J-'i, so ([S]) 
provides an alternative, equivalent representation of the density f^'', in which each row 
of the unmixing matrix has unit Euclidean length. By reducing to a further subsequence if 
necessary, we may assume that for each j = 1, . . . ,d, there exists Wj G M*^ such that Wj — )■ Wj 
as /c — )> oo. By Slutsky's theorem, it then follows that 

{{w',yx\ . . . , iw'.yx') 4 {wJX, . . . ,^JX). 

Thus, for any t G M'', 



i=i i=i 



We conclude that w^^; • • • 5 ''^I^ independent. Since = 1 for all j, we deduce 

further that W = {wi, . . . , Wrf)""" is non- singular . M oreover , each of wJX, . . . , wJX has a log- 



concave density, by Theorem 6 of iPrekopal (jl973l ). This shows that /** G J^^^^y required. 
□ 

Proof of Theorem [2] 
Suppose that P G satisfies 

d 

PiB) = l[P,{wjB) 
i=i 

for some TV G W and Pi , . . . , G Pi . Consider maximising 

log f{x) dP{x) 

over / G J^d- Letting s = Wx and f{s) = f{As), where A = W~^, we can equivalently 
maximise 



l^hgf{s)d(^(^P,{s,)^ 



i=i 
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over / G J^d- But, by Theorem 4 of IChen and SamworthI (120121 ). the unique solution to this 
maximisation problem is to choose f{z) = Y['j=i fji^j)^ where f* = ip*{Pj). This shows that 
/* := ip*{P) can be written as 



r(x) = |detiyin/ 



1 v^J^) 



Since /* G J^^^"^ slso, we deduce that /* is also the unique maximiser of /jg^ log / dP over 

f eJ^f^,so%l)**{P) = %lj*{P). □ 
Proof of Theorem [3] 

Suppose that P G V^^^. Let X ~ P, so there exists W such that WX has independent 
components. Writing Pj for the marginal distr ibution of wJX, note that Pi, . . . ,Pd G Vi. 
By Theorem [2] and the identifiability result of Eriksson and Koivuneru (j2004j ) , it therefore 
suffices to show that Pj G Vi has a Gaussian density if and only if il)*{Pj) is a Gaussian 
density. If Pj has a Gaussian density /*, then since /* is log-concave, we have /* = ■ip*{Pj). 
Conversely, suppose that Pj do es not have a Gaussian density. Since f * = il '*(Pj) satisfies 
/^2;(iPj(x) = J^^xfj{x)dx (jDiimbgen. Samworth and Schuhmachen . 120111 Remark 2.3), 
we may assume without loss of generality that Pj and f* have mean zero. We consider 
maximising 

POD 

log fdP, 

) 

for the mean zero Gaussian density 



over all mean zero Gaussian densities /. Writing 
with variance o"^, we have 



/OO /»00 -j^ 

log 0,2 dPj = -^J dPj{x) - - log(27ra2). 



'his e xpression is maximised uniquely in o"^ at = dPj{x). But 



Chen and Samworth 



(I2OI2I ) show that the only way a distribution Pj and its log-concave projection il)*{Pj) can 



have the same second moment is if Pj has a log-concave density, in which case Pj has density 
4'*{Pj)- We therefore conclude that the only way ip*{Pj) can be a Gaussian density is if Pj 
has a Gaussian density, a contradiction. □ 
Proof of Proposition H] 



'he p roof of this proposition is very similar to the proof of Theorem 4.5 of lDiimbgen. Samworth and 



( 120111 ). so we only sketch the argument here. For each n G N, let /" G iIj**{P^), and consider 
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an arbitrary subsequence (/"'')• By reducing to a further subsequence if necessary, we may 
assume that L**(P"'=) -> A G [-00, oo]. Observe that 

A > hm / log(2-V^j=il'=^l) dP"'{x) = -rflog2 - V / \xj\ dP{x) > -00. 

Arguments from convex analysis can be used to show that the sequence (/"'-') is uniformly 
bounded above, and liminffcgN f^'^{xo) > —00 for all Xq € int(csupp(P)). From this it follows 
that there exist a > and 6 G M such that sup^gj^ sup^gj^d /"''(x) < — + b. Thus, by 
reducing to a further subsequence if necessary, we may assume there exists /** G J-'d such 
that 

limsup r^ix) = f**{xo) for all Xq G \ d{x G R'^ : f**ix) > 0} (9) 
limsup r^ix) < r*{xo) for all xo G d{x G M'^ : f**{x) > 0}. 



Note from this that 



lim / log dP'"= < -a [ \\x\\dP{x) + b 



A = lim / log dP'"' < -a I \\x\\ dP(x) + b<oo. 

k- 

In fact, we can use the argument from the proof of Proposition [1] to deduce that /** G J^^^^- 
Skorokhod's representation theorem and Fatou's lemma can then be used to show that 

\<j^Aogr*dP<L**iP). 

We can obtain the other bound A > L**{P) by taking any element of iIj**{P), approxi- 
m ating it from above using Lipschitz cont i nuous functions, as in the proof of Theorem 4.5 



of 



Diimbgen. Samworth and Schuhmacherl ( 2011 ). and using monotone convergence. From 



these arguments, we conclude that L**(P") — )■ L**{P) and /** G ip**{P). 

We can see from (E]) that /**, so J^^ IF' - f**\ ^ 0, by Scheffe's theorem. Thus, 

given any /" G ip**{P^) and any subsequence we can find /** G ip**{P) and a further 

subsequence of (/"'-■) which converges to /** in total variation distance. This yields the 
second part of the proposition. □ 
Proof of Theorem [5] 

The first part of the theorem is a special case of Proposition ID Now suppose P G Vf"^ 
is identifiable and is represented hj W E W and Pi, . . . , G Vi. Suppose without loss of 
generality that \\wj\\ = 1 for all j = 1, . . . ,d and let /** = il)**{P). Recall from Theorem [2] 
that if X has density /**, then lu-X has density f* = ip*{Pj). 
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Suppose for a contradiction tliat we can find e > 0, integers 1 < rii < n2 < ■ ■ ■, & 
^**(pnfe) and /f , . . . , f^) f such that 



inf inf inf 

fceN efcgR\{0} 7rfeGn„ 



Will + / ||e-|/;^fe(j)(e-a;) - f;{x)\ dxj > e. 

We can find a subsequence 1 < ki < k2 < . . . such that ti'^VII^j'll ~^ say, as / — t- oo, 
for all j = 1, . . . , d. The argument towards the end of the proof of Case 3 of Proposition [T] 



shows th at W can be used to represent the unmixing matrix of /**, so by the identifiability 

and the fact that \\wj\\ = 1, there exist ei, . . . , q G 



result of 



Eriksson and Koivunei 



ep resent i 

d (2QqJ 



{ — 1,1} and a permutation tt of {l,...,d} such that ejW.„(^j) = Wj. Setting vr" = vr and 



e 



we deduce that 



for j = l,...,d. Now observe that if X'^' has density Z'^', then by Slutsky's theorem, 



(ej') i,^^..))"'"^^' — ^J^- It therefore follows from Proposition 2(c) of 

mm that 

/oo 
-oo 

for j = 1, . . . ,d. This contradiction establishes that 



Cule and Samworth 



sup sup inf inf 



+ 



\e]\f:.^^^ie]x)~f;ix)\dx\-,0 



(10) 



for each j = 1, . . . ,d. 

It remains to prove that for sufficiently large n, every g ip**[P^) is identifiable. Recall 
from the identifiability result of Eriksson and KoivunenI (j2004( ) and Theorem [3] that not more 
than one of fl, ■ ■ ■ , fd is Gaussian. Let (p^^a^i-) denote the univariate normal density with 
mean n and variance cx^. Let J denote the index set of the non- Gaussian densities among 
/^*, . . . , so the cardinality of J is at least d — 1, and consider, for each j G J, the problem 
of minimising g{fi, a) = |0^i,o-2 — fj \ over /i G R and cr > 0. Observe that g is continuous 
with g{fi, a) < 2 for all fj, and a, that inf^gK (yf(/i, cr) 2 as cr 0, oo and info->o 5'(/i, a) — )• 2 
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as — )■ oo. It follows that g attains its infimum, and there exists 77 > such that 



00 



inf inf inf / |0^,2 -/*| > 7]. (11) 



Comparing ( fTOl) and ( ITTj) . we see that, for sufficiently large n, whenever /" G i/j**{P^) and 
(VT", . . . , /^) at most one of the densities , • • • , can be Gaussian. It follows 

that when n is large, every /" G ip**{P'^) is identifiable. □ 
Proof of Proposition [6] 

It is well-known that for fixed W G W, the nonparametric likelihood L(-) defined in (jlj) is 
maximised by choosing 



For i = 1 . . . , n, G yy and j = 1, . . . , d, let 

^«>,(0 = {i G {1, . . . : wjx- = tyjxi}. 



The binary relation z ~ z if n^.{i) = n^.{i) defines an equivalence relation on {1, . . . ,n}, so 
we can let denote a set of indices obtained by choosing one element from each equivalence 
class. Then 



L{W, P^, ...,P^) = f[ ^"""^ 1 1^"-. (2) I • • • Kj (^) I _ -Q ^-n -Q |„ 

i=i i=i ie/w' 



Since xi, . . . , x„ are in general position by hypothesis, we have that X]jG/w-(|?^«,j (i)| — 1) < 
d-1. It follows that L{W, , . . . , P^) < {d'^/rf'Y. Moreover, for any choice J of dis- 
tinct indices in {1, . . . , n} if we construct the matrix Wj G W as described just before the 
statement of Proposition El then LiWj, P^' , P^') = {d'^/rfY. □ 
Proof of Corollary [8] 

Let P"'^ denote the empirical distribution of zi = S'^^^xi, . . . , z„ = S^^^^x^. Writing 
z = n^^ XliLi ^« ^"^^ ^ ~ ''^^^ X]r=i ■'^i' note that the covariance matrix corresponding to P"'^ 
is 

i V(z, - z)(z, - z)"^ = - V S-^/^(x, - x)(x, - x)"^S-i/^ = /. 



— ' n 

i=l 1=1 



Observe further that there is a bijection between the set of maximisers (H^", . . . , f2) 
of /i,...,/d;xi,...,x„) over W G 0((i)S~^/2 and G J^i, and the set of 
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maximisers (6", §1,..., gfj of ^(0, 5-1, ... , Qd] Zi, . . . , z„) over O E 0{d) and gi, . . . , gd E J'l 
via the correspondence W^" = 0"'fl^^^^ and /" = g'j. 

It follows from the discussion in Section 13.21 that maximising £"(0, gi, gd] zi, ... , z„) 
over O G 0(c/) and gi, . . . ,gd G -Fi amounts to computing the log-concave ICA projection 
of P"'^. Existence of a maximiser therefore follows from Proposition [1] and the fact that the 
convex hull of Zi, . . . , z„ is dimensional with probability 1 for sufficiently large n. 

Now suppose O" and g'l, . . . ,g^ represent the log-concave ICA projection ?/^**(P"'^). Fur- 
ther, let P^'^ denote the distribution of S^^/^xi, so P°'^ G V^'"^ has identity covariance 
matrix and suppose (0°, P°■^ . . . , P°'^) P°'^ Then c/(P"'^ P°'^) as n ^ 00, so by 
Theorem[5l there exist a permutation vr" of {1, ... , c?} and scaling factors e", . . . , eJJ G ]R\ {0} 
such that 

/oo 
\\i]\gl^^^0]x)-g;ix)\dx''-^O, 
-00 



IS-. 



where ^* = Writing = O^S-^/^ IV" = 6"S-i/^ /j* = and noting that 

g* = tp*{Pj'^) = ilj*{Pj) = fj, the conclusion of the corollary follows immediately. □ 
Proof of Proposition M 

For e > 0, let = W exp{eY), and let w^^e denote the jth row of W^. Notice that 

as e \ 0. It follows that for sufficiently small e > 0, 
g{W,)-giW) 



-V'V'I min {bjkwj^^i- f3jk) - min {bjkwj:sii - f3jk)\ 

a '—^ ' ' l,fc=l,...,m, k=l,...,mj ) 

i=l j=l 

^ n d 



^ ^ - wjx 

j=i j=i 



i=i j=i 



as e \ 0. □ 
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